First analysis is to test whether an increase in BinomFreq results in
a greater difference between the representation of the phrase and the
representation of the individual pieces. The interaction term with
RelFreq is included because it seems plausible that relative frequency
may also play an important role. For example if the binomial X and Y has
a frequency of 3000, but the binomial Y and X also has a frequency of
3000, then these may not have separate representations. But if X and Y
has a frequency of 3000, but Y and X only has a frequency of 10, then
these may have drastically different representations. In other words, an
increase in Binomial frequency may only result in a larger cosine
difference if the relative frequency is also higher.
Item here may also be a bit misleading, Item here refers to a given
sentence context, which is the same for a given binomial regardless of
order. Thus the sentence context for intents and purposes is the same as
the sentence context for purposes and intents. The Item intercept is
included because there may be certain sentence contexts that result in a
higher or lower cosine difference.
Olmo-7b
cosine_data = read_csv('../Data/allenai_OLMo-1B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
Model:
options(contrasts = c("contr.sum","contr.sum"))
m1 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data,
family = gaussian(),
warmup = 10000,
iter = 20000,
cores = 4,
chains = 4,
control = list(max_treedepth = 15, adapt_delta = 0.95),
file = '../Data/model1')
fixef(m1)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.166701716 0.0031038051 0.16063683 0.172780299
## LogBinomFreq -0.003405644 0.0003729663 -0.00413395 -0.002673305
m2 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data_m2,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = '../Data/model2')
fixef(m2)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.172169223 0.0043851174 0.163619822 0.180752461
## LogBinomFreq -0.004083686 0.0005015375 -0.005070699 -0.003101475
conditional_effects(m1, ask = F)

conditional_effects(m2, ask = F)

Now let’s look at the relationship between the alphabetical vs
nonalphabetical cosine distance. To do this, we’ll get a new variable
which is the cosine distance of the alphabetical minus the cosine
distance of the nonalphabetical. In other words, this variable will
represent how much more similar the alphabetical item is to its parts
than the nonalphabetical. A greater value will mean that the
alphabetical form is more similar to its pieces than nonalphabetical, a
more negative value will mean that the nonalphabetical form is more
similar to its pieces.
cosine_data_m3 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(cosine_diff = cosine_sim - first(cosine_sim)) %>%
group_by(Item) %>%
top_n(1, abs(cosine_diff)) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
One prediction is that for items with a large overall frequency, the
effect of relative frequency on this difference will be larger. Now we
can actually use RelFreq because it’s meaningful: a more positive
RelFreq means more preferred in the Alphabetical form. A larger
cosine_diff means the alpha form is more similar to its parts. Thus a
prediction is that for items with a high overall frequency, a larger
relative frequency may result in a smaller cosine_diff (i.e., a more
obscure relationship between the meaning of the phrase and its pieces).
For items with a high overall frequency, a more negative relative
frequency might result in a smaller cosine diff.
On the other hand, for items with a small overall frequency, the
effect of relative frequency on cosine diff may be negligible.
options(contrasts = c("contr.sum","contr.sum"))
m3 = brm(cosine_diff ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m3,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model3')
fixef(m3)
## Estimate Est.Error Q2.5
## Intercept 0.0003235332 0.0011490650 -0.0019690429
## CenteredLogOverallFreq -0.0001188927 0.0003826474 -0.0008642625
## RelFreq -0.0200271320 0.0038959598 -0.0277477342
## CenteredLogOverallFreq:RelFreq -0.0019538024 0.0008583814 -0.0036332824
## Q97.5
## Intercept 0.0025303600
## CenteredLogOverallFreq 0.0006154554
## RelFreq -0.0124108011
## CenteredLogOverallFreq:RelFreq -0.0002903793
conditional_effects(m3, ask = F)



post_samples_m3 = as.data.frame(fixef(m3, summary = F))
post_samples_OverallFreq = sum(post_samples_m3$CenteredLogOverallFreq < 0) / length(post_samples_m3$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m3$RelFreq > 0) / length(post_samples_m3$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m3$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m3$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.62025
print(post_samples_RelFreq)
## [1] 0
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.989125
interact_plot(m3, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

What if instead of the difference, we get a log odds ratio?
cosine_data_m4 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 125: `Item = 126`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
options(contrasts = c("contr.sum","contr.sum"))
m4 = brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m4,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model4.1')
fixef(m4)
## Estimate Est.Error Q2.5
## Intercept 0.001390268 0.008530730 -0.014944736
## CenteredLogOverallFreq -0.001301627 0.002849741 -0.006950447
## RelFreq -0.151497017 0.028825973 -0.208381317
## CenteredLogOverallFreq:RelFreq -0.017054405 0.006542373 -0.029879392
## Q97.5
## Intercept 0.017917946
## CenteredLogOverallFreq 0.004214958
## RelFreq -0.094365072
## CenteredLogOverallFreq:RelFreq -0.003973157
conditional_effects(m4, ask = F)



post_samples_m4 = as.data.frame(fixef(m4, summary = F))
post_samples_OverallFreq = sum(post_samples_m4$CenteredLogOverallFreq < 0) / length(post_samples_m4$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m4$RelFreq > 0) / length(post_samples_m4$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m4$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m4$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.6755
print(post_samples_RelFreq)
## [1] 0
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.99425
interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', interval = T)

olmo7b_plot = conditional_effects(m4, plot = F, effects = "CenteredLogOverallFreq:RelFreq", int_conditions=list(RelFreq = c(-0.25, 0, 0.25)))
GPT 2
cosine_data = read_csv('../Data/gpt2_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 1572 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (2): ...1, cosine_diffs
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
Model
options(contrasts = c("contr.sum","contr.sum"))
m1 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data,
family = gaussian(),
warmup = 10000,
iter = 20000,
cores = 4,
chains = 4,
control = list(max_treedepth = 15, adapt_delta = 0.95),
file = '../Data/model1_gpt2')
fixef(m1)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.64015214 0.0025093056 0.635204111 0.645053791
## LogBinomFreq -0.00473829 0.0003090429 -0.005339843 -0.004126131
m2 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data_m2,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = '../Data/model2_gpt2')
fixef(m2)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.64693962 0.0036638002 0.639795827 0.654169755
## LogBinomFreq -0.00558006 0.0004273125 -0.006419095 -0.004775577
conditional_effects(m1, ask = F)

conditional_effects(m2, ask = F)

options(contrasts = c("contr.sum","contr.sum"))
cosine_data_m3 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(cosine_diff = cosine_sim - first(cosine_sim)) %>%
group_by(Item) %>%
top_n(1, abs(cosine_diff)) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
m3 = brm(cosine_diff ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m3,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model3_gpt2')
fixef(m3)
## Estimate Est.Error Q2.5
## Intercept 0.0007371647 0.0011257263 -0.001446338
## CenteredLogOverallFreq -0.0003964445 0.0003750113 -0.001140101
## RelFreq -0.0209867568 0.0039377924 -0.028884373
## CenteredLogOverallFreq:RelFreq -0.0029514384 0.0008604733 -0.004640186
## Q97.5
## Intercept 0.002947282
## CenteredLogOverallFreq 0.000314388
## RelFreq -0.013214139
## CenteredLogOverallFreq:RelFreq -0.001269423
conditional_effects(m3, ask = F)



post_samples_m3 = as.data.frame(fixef(m3, summary = F))
post_samples_OverallFreq = sum(post_samples_m3$CenteredLogOverallFreq < 0) / length(post_samples_m3$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m3$RelFreq > 0) / length(post_samples_m3$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m3$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m3$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.853375
print(post_samples_RelFreq)
## [1] 0
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.9995
interact_plot(m3, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

cosine_data_m4 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
options(contrasts = c("contr.sum","contr.sum"))
m4 = brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m4,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model4_gpt2')
fixef(m4)
## Estimate Est.Error Q2.5
## Intercept 0.0011552924 0.0019034895 -0.002522975
## CenteredLogOverallFreq -0.0006775682 0.0006319011 -0.001920465
## RelFreq -0.0350346543 0.0064110665 -0.047257276
## CenteredLogOverallFreq:RelFreq -0.0051057373 0.0014475828 -0.007910144
## Q97.5
## Intercept 0.0048501189
## CenteredLogOverallFreq 0.0005555468
## RelFreq -0.0223287319
## CenteredLogOverallFreq:RelFreq -0.0022799475
gpt2_plot = conditional_effects(m4, plot = F, effects = "CenteredLogOverallFreq:RelFreq", int_conditions=list(RelFreq = c(-0.25, 0, 0.25)), ask = F)
post_samples_m4 = as.data.frame(fixef(m4, summary = F))
post_samples_OverallFreq = sum(post_samples_m4$CenteredLogOverallFreq < 0) / length(post_samples_m4$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m4$RelFreq > 0) / length(post_samples_m4$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m4$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m4$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.85775
print(post_samples_RelFreq)
## [1] 0
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.999625
interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', interval = T)

GPT 2-XL
cosine_data = read_csv('../Data/gpt2-xl_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 1572 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (2): ...1, cosine_diffs
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
Model
options(contrasts = c("contr.sum","contr.sum"))
m1 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data,
family = gaussian(),
warmup = 10000,
iter = 20000,
cores = 4,
chains = 4,
control = list(max_treedepth = 15, adapt_delta = 0.95),
file = '../Data/model1_gpt')
fixef(m1)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.907282572 0.0036877281 0.900006730 0.914464088
## LogBinomFreq -0.002310599 0.0004493613 -0.003184477 -0.001431652
m2 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data_m2,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = '../Data/model2_gpt')
fixef(m2)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.905084594 0.0055914369 0.894096944 0.9160066606
## LogBinomFreq -0.002080987 0.0006492235 -0.003349974 -0.0008014259
conditional_effects(m1, ask = F)

conditional_effects(m2, ask = F)

options(contrasts = c("contr.sum","contr.sum"))
cosine_data_m3 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(cosine_diff = cosine_sim - first(cosine_sim)) %>%
group_by(Item) %>%
top_n(1, abs(cosine_diff)) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
m3 = brm(cosine_diff ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m3,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model3_gpt')
fixef(m3)
## Estimate Est.Error Q2.5
## Intercept -0.001456023 0.0015054355 -0.0043913498
## CenteredLogOverallFreq 0.001364609 0.0005015388 0.0003627335
## RelFreq 0.001642699 0.0050938782 -0.0085099740
## CenteredLogOverallFreq:RelFreq 0.002162254 0.0011460776 -0.0001136759
## Q97.5
## Intercept 0.001515622
## CenteredLogOverallFreq 0.002346755
## RelFreq 0.011658931
## CenteredLogOverallFreq:RelFreq 0.004378403
conditional_effects(m3, ask = F)



post_samples_m3 = as.data.frame(fixef(m3, summary = F))
post_samples_OverallFreq = sum(post_samples_m3$CenteredLogOverallFreq < 0) / length(post_samples_m3$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m3$RelFreq > 0) / length(post_samples_m3$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m3$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m3$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.004125
print(post_samples_RelFreq)
## [1] 0.630875
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.0295
interact_plot(m3, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

cosine_data_m4 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
options(contrasts = c("contr.sum","contr.sum"))
m4 = brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m4,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model4_gpt')
fixef(m4)
## Estimate Est.Error Q2.5
## Intercept -0.001864345 0.0017912068 -5.405232e-03
## CenteredLogOverallFreq 0.001628159 0.0006090176 4.181959e-04
## RelFreq 0.003101389 0.0062885968 -9.040355e-03
## CenteredLogOverallFreq:RelFreq 0.002813284 0.0013997650 7.464377e-05
## Q97.5
## Intercept 0.001681689
## CenteredLogOverallFreq 0.002814514
## RelFreq 0.015316569
## CenteredLogOverallFreq:RelFreq 0.005513055
conditional_effects(m4, ask = F)



post_samples_m4 = as.data.frame(fixef(m4, summary = F))
post_samples_OverallFreq = sum(post_samples_m4$CenteredLogOverallFreq < 0) / length(post_samples_m4$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m4$RelFreq > 0) / length(post_samples_m4$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m4$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m4$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.00325
print(post_samples_RelFreq)
## [1] 0.6845
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.02175
interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', interval = T)

gpt2xl_plot = conditional_effects(m4, plot = F, effects = "CenteredLogOverallFreq:RelFreq", int_conditions=list(RelFreq = c(-0.25, 0, 0.25)))
cosine_data = read_csv('../Data/meta-llama_Llama-2-7b-hf_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 1572 Columns: 3
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (2): ...1, cosine_diffs
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
Model
options(contrasts = c("contr.sum","contr.sum"))
m1 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data,
family = gaussian(),
warmup = 10000,
iter = 20000,
cores = 4,
chains = 4,
control = list(max_treedepth = 15, adapt_delta = 0.95),
file = '../Data/model1_llama')
fixef(m1)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.748943934 0.0051657258 0.738792429 0.75906956
## LogBinomFreq -0.003447564 0.0006170395 -0.004648682 -0.00224391
m2 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data_m2,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = '../Data/model2_llama')
fixef(m2)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.761572280 0.0075328524 0.746792172 0.776142111
## LogBinomFreq -0.004939076 0.0008679515 -0.006589139 -0.003205171
conditional_effects(m1, ask = F)

conditional_effects(m2, ask = F)

options(contrasts = c("contr.sum","contr.sum"))
cosine_data_m3 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(cosine_diff = cosine_sim - first(cosine_sim)) %>%
group_by(Item) %>%
top_n(1, abs(cosine_diff)) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
m3 = brm(cosine_diff ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m3,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model3_llama')
fixef(m3)
## Estimate Est.Error Q2.5
## Intercept -0.0025493679 0.0018660791 -0.006146530
## CenteredLogOverallFreq -0.0009997556 0.0006256822 -0.002240554
## RelFreq -0.0083357746 0.0063537112 -0.020618080
## CenteredLogOverallFreq:RelFreq -0.0024509120 0.0014310760 -0.005264157
## Q97.5
## Intercept 0.0012039028
## CenteredLogOverallFreq 0.0002385347
## RelFreq 0.0040826647
## CenteredLogOverallFreq:RelFreq 0.0003660968
conditional_effects(m3, ask = F)



post_samples_m3 = as.data.frame(fixef(m3, summary = F))
post_samples_OverallFreq = sum(post_samples_m3$CenteredLogOverallFreq < 0) / length(post_samples_m3$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m3$RelFreq > 0) / length(post_samples_m3$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m3$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m3$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.946125
print(post_samples_RelFreq)
## [1] 0.092875
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.957625
interact_plot(m3, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

cosine_data_m4 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
options(contrasts = c("contr.sum","contr.sum"))
m4 = brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m4,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model4_llama')
fixef(m4)
## Estimate Est.Error Q2.5
## Intercept -0.003639279 0.0027009905 -0.008943772
## CenteredLogOverallFreq -0.001594731 0.0008752394 -0.003303457
## RelFreq -0.012798011 0.0091534014 -0.030505741
## CenteredLogOverallFreq:RelFreq -0.003762644 0.0020283677 -0.007709659
## Q97.5
## Intercept 0.0016726595
## CenteredLogOverallFreq 0.0001249013
## RelFreq 0.0055621270
## CenteredLogOverallFreq:RelFreq 0.0001953563
conditional_effects(m4, ask = F)



post_samples_m4 = as.data.frame(fixef(m4, summary = F))
post_samples_OverallFreq = sum(post_samples_m4$CenteredLogOverallFreq < 0) / length(post_samples_m4$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m4$RelFreq > 0) / length(post_samples_m4$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m4$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m4$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.966
print(post_samples_RelFreq)
## [1] 0.080625
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.96825
interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', interval = T)

llama2_plot = conditional_effects(m4, plot = F, effects = "CenteredLogOverallFreq:RelFreq", int_conditions=list(RelFreq = c(-0.25, 0, 0.25)))
Olmo 1b
cosine_data = read_csv('../Data/allenai_OLMo-1B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
Model
options(contrasts = c("contr.sum","contr.sum"))
m1 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data,
family = gaussian(),
warmup = 10000,
iter = 20000,
cores = 4,
chains = 4,
control = list(max_treedepth = 15, adapt_delta = 0.95),
file = '../Data/model1_olmo1b')
fixef(m1)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.166805872 0.0031090527 0.160745435 0.17291338
## LogBinomFreq -0.003419665 0.0003688395 -0.004155244 -0.00269542
m2 = brm(cosine_sim ~ LogBinomFreq + (1|Item),
data = cosine_data_m2,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = '../Data/model2_olmo1b')
fixef(m2)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 0.17231202 0.0044375012 0.163593080 0.181231376
## LogBinomFreq -0.00409542 0.0005111836 -0.005104293 -0.003077654
conditional_effects(m1, ask = F)

conditional_effects(m2, ask = F)

options(contrasts = c("contr.sum","contr.sum"))
cosine_data_m3 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(cosine_diff = cosine_sim - first(cosine_sim)) %>%
group_by(Item) %>%
top_n(1, abs(cosine_diff)) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
m3 = brm(cosine_diff ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m3,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model3_olmo1b')
fixef(m3)
## Estimate Est.Error Q2.5
## Intercept 0.0003471543 0.0011795469 -0.0019331332
## CenteredLogOverallFreq -0.0001194977 0.0003815781 -0.0008813918
## RelFreq -0.0199127186 0.0038767994 -0.0274208761
## CenteredLogOverallFreq:RelFreq -0.0019611811 0.0008967335 -0.0037341066
## Q97.5
## Intercept 0.0026280298
## CenteredLogOverallFreq 0.0006267294
## RelFreq -0.0122533122
## CenteredLogOverallFreq:RelFreq -0.0001936978
conditional_effects(m3, ask = F)



post_samples_m3 = as.data.frame(fixef(m3, summary = F))
post_samples_OverallFreq = sum(post_samples_m3$CenteredLogOverallFreq < 0) / length(post_samples_m3$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m3$RelFreq > 0) / length(post_samples_m3$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m3$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m3$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.624
print(post_samples_RelFreq)
## [1] 0
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.9855
interact_plot(m3, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

cosine_data_m4 = cosine_data %>%
group_by(Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 125: `Item = 126`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
options(contrasts = c("contr.sum","contr.sum"))
m4 = brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = cosine_data_m4,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
#init = 0.1,
file = '../Data/model4_olmo1b')
fixef(m4)
## Estimate Est.Error Q2.5
## Intercept 0.001378763 0.008477859 -0.015375136
## CenteredLogOverallFreq -0.001281864 0.002807376 -0.006864814
## RelFreq -0.151945542 0.028113398 -0.206736803
## CenteredLogOverallFreq:RelFreq -0.017055816 0.006459556 -0.029717299
## Q97.5
## Intercept 0.017862816
## CenteredLogOverallFreq 0.004152781
## RelFreq -0.098632816
## CenteredLogOverallFreq:RelFreq -0.004546817
conditional_effects(m4, ask = F)



post_samples_m4 = as.data.frame(fixef(m4, summary = F))
post_samples_OverallFreq = sum(post_samples_m4$CenteredLogOverallFreq < 0) / length(post_samples_m4$CenteredLogOverallFreq)
post_samples_RelFreq = sum(post_samples_m4$RelFreq > 0) / length(post_samples_m4$RelFreq)
post_samples_overallfreq_relfreq = sum(post_samples_m4$`CenteredLogOverallFreq:RelFreq` < 0) / length(post_samples_m4$`CenteredLogOverallFreq:RelFreq`)
print(post_samples_OverallFreq)
## [1] 0.67
print(post_samples_RelFreq)
## [1] 0
print(post_samples_overallfreq_relfreq) #not significant but damn close
## [1] 0.99575
interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', plot.points = T)

interact_plot(m4, pred = 'CenteredLogOverallFreq', modx = 'RelFreq', interval = T)

olmo1b_plot = conditional_effects(m4, plot = F, effects = "CenteredLogOverallFreq:RelFreq", int_conditions=list(RelFreq = c(-0.25, 0, 0.25)))
Olmo1b at different layers
Main Model:
options(contrasts = c("contr.sum","contr.sum"))
cosine_data = read_csv('../Data/allenai_OLMo-1B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
#cosine_data$layer = factor(cosine_data$layer, levels = c('1', '3', '6', '14', '-2'))
cosine_data$layer = factor(cosine_data$layer)
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 3261: `layer = 4`, `Item = 126`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_main_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_main = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = 'main')
posterior_main = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = 'main')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_main_all_layers <- bind_rows(plot_data_list)
olmo_main_all_layers$layer = factor(olmo_main_all_layers$layer, levels = names(cosine_data_list))
olmo_main_all_layers = olmo_main_all_layers %>%
filter(layer != 16)
olmo_main_all_layers$RelFreq = factor(olmo_main_all_layers$RelFreq)
olmo_main_all_layers_plot = olmo_main_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 425000')
olmo_main_all_layers_plot

Step 425000 (1783B Tokens)
cosine_data = read_csv('../Data/allenai_OLMo-1B_step425000-tokens1783B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_step425000_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_425000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = '425000')
posterior_425000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = '425000')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_425000_all_layers <- bind_rows(plot_data_list)
olmo_425000_all_layers$layer = factor(olmo_425000_all_layers$layer, levels = names(cosine_data_list))
olmo_425000_all_layers = olmo_425000_all_layers %>%
filter(layer != 16)
olmo_425000_all_layers$RelFreq = factor(olmo_425000_all_layers$RelFreq)
olmo_425000_all_layers_plot = olmo_425000_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 425000')
olmo_425000_all_layers_plot

Step 100000 (419B Tokens)
cosine_data = read_csv('../Data/allenai_OLMo-1B_step100000-tokens419B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 149 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 8634: `layer = 11`, `Item = 10`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 148 remaining warnings.
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_step100000_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_100000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = '100000')
posterior_100000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = '100000')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_100000_all_layers <- bind_rows(plot_data_list)
olmo_100000_all_layers$layer = factor(olmo_100000_all_layers$layer, levels = names(cosine_data_list))
olmo_100000_all_layers = olmo_100000_all_layers %>%
filter(layer != 16)
olmo_100000_all_layers$RelFreq = factor(olmo_100000_all_layers$RelFreq)
olmo_100000_all_layers_plot = olmo_100000_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 425000')
olmo_100000_all_layers_plot

Step 50000 (210B Tokens)
cosine_data = read_csv('../Data/allenai_OLMo-1B_step50000-tokens210B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data$layer = factor(cosine_data$layer)
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 11 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 8063: `layer = 10`, `Item = 225`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 10 remaining warnings.
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_step50000_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_50000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = '50000')
posterior_50000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = '50000')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_50000_all_layers <- bind_rows(plot_data_list)
olmo_50000_all_layers$layer = factor(olmo_50000_all_layers$layer, levels = names(cosine_data_list))
olmo_50000_all_layers = olmo_50000_all_layers %>%
filter(layer != 16)
olmo_50000_all_layers$RelFreq = factor(olmo_50000_all_layers$RelFreq)
olmo_50000_all_layers_plot = olmo_50000_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 50000')
olmo_50000_all_layers_plot

Step 40000 (168B Tokens)
cosine_data = read_csv('../Data/allenai_OLMo-1B_step40000-tokens168B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data$layer = factor(cosine_data$layer)
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 57 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 8694: `layer = 11`, `Item = 70`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 56 remaining warnings.
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_step40000_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_40000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = '40000')
posterior_40000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = '40000')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_40000_all_layers <- bind_rows(plot_data_list)
olmo_40000_all_layers$layer = factor(olmo_40000_all_layers$layer, levels = names(cosine_data_list))
olmo_40000_all_layers = olmo_40000_all_layers %>%
filter(layer != 16)
olmo_40000_all_layers$RelFreq = factor(olmo_40000_all_layers$RelFreq)
olmo_40000_all_layers_plot = olmo_40000_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 425000')
olmo_40000_all_layers_plot

Step 30000 (126B Tokens)
cosine_data = read_csv('../Data/allenai_OLMo-1B_step30000-tokens126B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data$layer = factor(cosine_data$layer)
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 88 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 8630: `layer = 11`, `Item = 6`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 87 remaining warnings.
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_step30000_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_30000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = '30000')
posterior_30000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = '30000')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_30000_all_layers <- bind_rows(plot_data_list)
olmo_30000_all_layers$layer = factor(olmo_30000_all_layers$layer, levels = names(cosine_data_list))
olmo_30000_all_layers = olmo_30000_all_layers %>%
filter(layer != 16)
olmo_30000_all_layers$RelFreq = factor(olmo_30000_all_layers$RelFreq)
olmo_30000_all_layers_plot = olmo_30000_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 425000')
olmo_30000_all_layers_plot

Step 20000 (84B tokens)
cosine_data = read_csv('../Data/allenai_OLMo-1B_step20000-tokens84B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial'))
## New names:
## Rows: 26724 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): binom dbl (3): ...1, cosine_diffs, layer
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
cosine_data$layer = factor(cosine_data$layer)
cosine_data = cosine_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
cosine_data_m2 = cosine_data %>%
filter(LogBinomFreq > 0) #These items might be driving the effect, let's make sure this isn't the case
#test_na = cosine_data[is.na(cosine_data$cosine_diffs),]
cosine_data_m4 = cosine_data %>%
#filter(layer=='-2') %>%
group_by(layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 130 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 8634: `layer = 11`, `Item = 10`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 129 remaining warnings.
# Split the data into a named list of dataframes by layer
cosine_data_list = cosine_data_m4 %>%
split(.$layer)
# Optionally, assign each dataframe in the list to its own variable (not recommended for large numbers of variables)
list2env(
setNames(cosine_data_list, paste0("cosine_data_l", names(cosine_data_list))),
envir = .GlobalEnv
)
## <environment: R_GlobalEnv>
layer_numbers = gsub("cosine_data_l", "", names(cosine_data_list)) %>% as.integer()
# Define the model fitting function
fit_model = function(data, layer) {
brm(log_odds_cosine ~ CenteredLogOverallFreq * RelFreq,
data = data,
family = gaussian(),
warmup = 2000,
iter = 4000,
cores = 4,
chains = 4,
file = paste0('../Data/model4_olmo1b_step20000_l', layer))
}
# Apply the model fitting function to each dataframe in the list
models = map2(cosine_data_list, layer_numbers, fit_model)
fixed_effects_20000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ fixef(.x, summary = TRUE) %>% # Extract fixed effects for each model
as.data.frame() %>%
rownames_to_column("fixed-effect") %>% # Convert row names to a column
mutate(
term = str_replace(`fixed-effect`, "\\.\\.\\..*", ""), # Clean up row names
layer = .y # Add the layer number
)
) %>%
mutate(checkpoint = '20000')
posterior_20000 = map2_dfr(
models, # List of models
layer_numbers, # Corresponding layer numbers
~ {
# Extract posterior samples for the fixed effect
post_samples = as.data.frame(fixef(.x, summary = FALSE))
# Compute the proportion of samples where the term < 0
proportion = sum(post_samples$`CenteredLogOverallFreq:RelFreq` < 0) /
length(post_samples$`CenteredLogOverallFreq:RelFreq`)
# Return a dataframe with layer and proportion
tibble(
layer = .y, # Layer number
proportion = proportion # Computed proportion
)
}
) %>%
mutate(checkpoint = '20000')
generate_plot_data = function(model, layer) {
conditional_effects(model, plot = FALSE, effects = "CenteredLogOverallFreq:RelFreq", int_conditions = list(RelFreq = c(-0.25, 0, 0.25)))[[1]] %>%
mutate(layer = as.character(layer)) # Assign the layer as a string
}
# Apply the function iteratively to all models
plot_data_list = map2(models, layer_numbers, generate_plot_data)
# Combine all layers into one dataframe
olmo_20000_all_layers <- bind_rows(plot_data_list)
olmo_20000_all_layers$layer = factor(olmo_20000_all_layers$layer, levels = names(cosine_data_list))
olmo_20000_all_layers = olmo_20000_all_layers %>%
filter(layer != 16)
olmo_20000_all_layers$RelFreq = factor(olmo_20000_all_layers$RelFreq)
olmo_20000_all_layers_plot = olmo_20000_all_layers %>%
ggplot(aes(x=CenteredLogOverallFreq, y = estimate__, color = RelFreq)) +
geom_smooth(method='lm', formula=y~x, se=F) +
geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_wrap(~layer, ncol = 4) +
theme_bw() #+
#ggtitle('Checkpoint: 425000')
olmo_20000_all_layers_plot

Aggregate data
options(contrasts = c("contr.sum","contr.sum"))
cosine_data = read_csv('../Data/allenai_OLMo-1B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = 'main')
cosine_data_425000 = read_csv('../Data/allenai_OLMo-1B_step425000-tokens1783B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = '425000')
cosine_data_100000 = read_csv('../Data/allenai_OLMo-1B_step100000-tokens419B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = '100000')
cosine_data_50000 = read_csv('../Data/allenai_OLMo-1B_step50000-tokens210B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = '50000')
cosine_data_40000 = read_csv('../Data/allenai_OLMo-1B_step40000-tokens168B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = '40000')
cosine_data_30000 = read_csv('../Data/allenai_OLMo-1B_step30000-tokens126B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = '30000')
cosine_data_20000 = read_csv('../Data/allenai_OLMo-1B_step20000-tokens84B_compositional_cosine_diffs.csv') %>%
left_join(all_sentences, by = c('binom' = 'Binomial')) %>%
mutate(checkpoint = '20000')
aggregate_data = cosine_data %>%
full_join(cosine_data_425000) %>%
full_join(cosine_data_100000) %>%
full_join(cosine_data_50000) %>%
full_join(cosine_data_40000) %>%
full_join(cosine_data_30000) %>%
full_join(cosine_data_20000)
aggregate_data$layer = factor(aggregate_data$layer)
aggregate_data = aggregate_data %>%
mutate(Item = factor(Item)) %>%
mutate(LogBinomFreq = log(BinomFreq+1)) %>%
mutate(RelFreq = RelFreq - 0.5) %>% #centering RelFreq
filter(!Item %in% c(125, 176)) %>% #these two items were giving llama13 some trouble
rename('cosine_sim' = cosine_diffs)
aggregate_data = aggregate_data %>%
#filter(layer=='-2') %>%
group_by(checkpoint, layer, Item) %>%
arrange(desc(binom), .by_group = T) %>%
mutate(log_odds_cosine = log(cosine_sim/first(cosine_sim))) %>% #larger value means that the alphabetical form is more similar to its parts than the nonalphabetical
group_by(checkpoint, layer, Item) %>%
mutate(LogOverallFreq = log(OverallFreq+1)) %>%
top_n(1, abs(log_odds_cosine)) %>%
ungroup() %>%
mutate(CenteredLogOverallFreq = LogOverallFreq - mean(LogOverallFreq))
## Warning: There were 437 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `log_odds_cosine = log(cosine_sim/first(cosine_sim))`.
## ℹ In group 8634: `checkpoint = "100000"`, `layer = 11`, `Item = 10`.
## Caused by warning in `log()`:
## ! NaNs produced
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 436 remaining warnings.
# Split the data into a named list of dataframes by layer
Aggregate Model Results
fixefs_all = fixed_effects_main %>%
full_join(fixed_effects_425000) %>%
full_join(fixed_effects_100000) %>%
full_join(fixed_effects_50000) %>%
full_join(fixed_effects_40000) %>%
full_join(fixed_effects_30000) %>%
full_join(fixed_effects_20000)
fixefs_all$checkpoint = factor(fixefs_all$checkpoint, levels=c('20000', '30000', '40000', '50000', '100000', '425000', 'main'))
posterior_all = posterior_main %>%
full_join(posterior_425000) %>%
full_join(posterior_100000) %>%
full_join(posterior_50000) %>%
full_join(posterior_40000) %>%
full_join(posterior_30000) %>%
full_join(posterior_20000)
posterior_all$checkpoint = factor(posterior_all$checkpoint, levels=c('20000', '30000', '40000', '50000', '100000', '425000', 'main'))
Aggregate Plot
Plot of model predictions
require(grid) # for the textGrob() function
## Loading required package: grid
olmo_20000_all_layers_plot2 = olmo_20000_all_layers_plot +
theme(plot.margin=unit(c(0.5,0.5,0.5,1), 'cm'))
olmo_30000_all_layers_plot2 = olmo_30000_all_layers_plot +
theme(plot.margin=unit(c(0.5,0.5,0.5,1), 'cm'))
olmo_40000_all_layers_plot2 = olmo_40000_all_layers_plot +
theme(plot.margin=unit(c(0.5,0.5,0.5,1), 'cm'))
olmo_50000_all_layers_plot2 = olmo_50000_all_layers_plot +
theme(plot.margin=unit(c(0.5,0.5,0.5,1), 'cm'))
olmo_100000_all_layers_plot2 = olmo_100000_all_layers_plot +
theme(plot.margin=unit(c(0.5,0.5,0.5,1), 'cm'))
olmo_main_all_layers_plot2 = olmo_main_all_layers_plot +
theme(plot.margin=unit(c(0.5,0.5,0.5,1), 'cm'))
figure = ggarrange(olmo_20000_all_layers_plot2 + rremove("ylab") + rremove("xlab"), olmo_30000_all_layers_plot2 + rremove("ylab") + rremove("xlab"), olmo_40000_all_layers_plot2 + rremove("ylab") + rremove("xlab"), olmo_50000_all_layers_plot2 + rremove("ylab") + rremove("xlab"), olmo_100000_all_layers_plot2 + rremove("ylab") + rremove("xlab"), olmo_main_all_layers_plot2 + rremove("ylab") + rremove("xlab"),# remove axis labels from plots
labels = c('20000', '30000', '40000', '50000', '100000', 'main'),
ncol = 3, nrow = 2,
common.legend = TRUE, legend = "bottom",
align = "hv",
font.label = list(size = 10, color = "black", face = "bold", family = NULL, position = "top"))
annotate_figure(figure, left = textGrob("Log Odds Cosine", rot = 90, vjust = 1, gp = gpar(cex = 1.3)),
bottom = textGrob("Overall Frequency", gp = gpar(cex = 1.3)))

# ggarrange(olmo_20000_all_layers_plot, olmo_30000_all_layers_plot, olmo_40000_all_layers_plot, olmo_50000_all_layers_plot, olmo_100000_all_layers_plot, olmo_main_all_layers_plot, common.legend = T, labels = c('20000', '30000', '40000', '50000', '100000', 'main'))
Plot of actual data
#aggregate_data$RelFreq = factor(aggregate_data$RelFreq, levels = NULL)
aggregate_data$checkpoint = factor(aggregate_data$checkpoint, levels = c('20000', '30000', '40000', '50000', '100000', '425000', 'main'))
aggregate_data$layer = factor(aggregate_data$layer)
aggregate_data = aggregate_data %>%
mutate(RelFreq_group = ifelse(RelFreq < 0, "nonalpha", "alpha"))
aggregate_data %>%
ggplot(aes(x=CenteredLogOverallFreq, y = log_odds_cosine, colour = RelFreq_group)) +
geom_point() +
geom_smooth(method='lm', formula=y~x, se=T) +
ylim(-2,2) +
#geom_ribbon(aes(ymin=lower__, ymax = upper__, fill = factor(RelFreq)), alpha = 0.5) +
ylab ('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_grid(~checkpoint) +
theme_bw()
## Warning: Removed 186 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 186 rows containing missing values or values outside the scale range
## (`geom_point()`).

aggregate_data_plot = ggplot(aggregate_data, aes(
x = CenteredLogOverallFreq,
y = log_odds_cosine,
color = RelFreq_group
)) +
geom_point(alpha = 0.2) +
geom_smooth(method = 'lm', formula = y ~ x, se = TRUE, linewidth = 1) +
ylab('Log Odds Cosine') +
xlab('Centered Log Overall Frequency') +
facet_grid(checkpoint ~ layer) +
theme_bw() +
scale_color_manual(
values = c("nonalpha" = "turquoise3", "alpha" = "deeppink2"),
name = "RelFreq Group"
) +
coord_cartesian(ylim = c(-0.5, 0.5))
aggregate_data_plot
